Proxisch: An Optimization Approach of Large-Scale Unstable Proxy Servers Scheduling

نویسندگان

  • Xiaoming Jiang
  • Jinqiao Shi
  • Qingfeng Tan
  • Wentao Zhang
  • Xuebin Wang
  • Muqian Chen
چکیده

Nowadays, big companies such as Google, Microsoft, which have adequate proxy servers, have perfectly implemented their web crawlers for a certain website in parallel. But due to lack of expensive proxy servers, it is still a puzzle for researchers to crawl large amounts of information from a single website in parallel. In this case, it is a good choice for researchers to use free public proxy servers which are crawled from the Internet. In order to improve efficiency of web crawler, the following two issues should be considered primarily: (1) Tasks may fail owing to the instability of free proxy servers; (2) A proxy server will be blocked if it visits a single website frequently. In this paper, we propose Proxisch, an optimization approach of large-scale unstable proxy servers scheduling, which allow anyone with extremely low cost to run a web crawler efficiently. Proxisch is designed to work efficiently by making maximum use of reliable proxy servers. To solve second problem, it establishes a frequency control mechanism which can ensure the visiting frequency of any chosen proxy server below the website’s limit. The results show that our approach performs better than the other scheduling algorithms. Keywords—Proxy server, priority queue, optimization approach, distributed web crawling.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Effective Task Scheduling Framework for Cloud Computing using NSGA-II

Cloud computing is a model for convenient on-demand user’s access to changeable and configurable computing resources such as networks, servers, storage, applications, and services with minimal management of resources and service provider interaction. Task scheduling is regarded as a fundamental issue in cloud computing which aims at distributing the load on the different resources of a distribu...

متن کامل

A New Approach in Job Shop Scheduling: Overlapping Operation

In this paper, a new approach to overlapping operations in job shop scheduling is presented. In many job shops, a customer demand can be met in more than one way for each job, where demand determines the quantity of each finished job ordered by a customer. In each job, embedded operations can be performed due to overlapping considerations in which each operation may be overlapped with the other...

متن کامل

Robust production scheduling in open-pit mining under uncertainty: a box counterpart approach

Open-Pit Production Scheduling (OPPS) problem focuses on determining a block sequencing and scheduling to maximize Net Present Value (NPV) of the venture under constraints. The scheduling model is critically sensitive to the economic value volatility of block, block weight, and operational capacity. In order to deal with the OPPS uncertainties, various approaches can be recommended. Robust opti...

متن کامل

An adaptive modified firefly algorithm to unit commitment problem for large-scale power systems

Unit commitment (UC) problem tries to schedule output power of generation units to meet the system demand for the next several hours at minimum cost. UC adds a time dimension to the economic dispatch problem with the additional choice of turning generators to be on or off.  In this paper, in order to improve both the exploitation and exploration abilities of the firefly algorithm (FA), a new mo...

متن کامل

A discrete-event optimization framework for mixed-speed train timetabling problem

Railway scheduling is a complex task of rail operators that involves the generation of a conflict-free train timetable. This paper presents a discrete-event simulation-based optimization approach for solving the train timetabling problem to minimize total weighted unplanned stop time in a hybrid single and double track railway networks. The designed simulation model is used as a platform for ge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016